Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
نویسندگان
چکیده
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bidimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDAon twoGPUs (NVIDIAGTX670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approacheswas analysed and it has been demonstrated that the addition of sharedmemory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications. © 2015 Elsevier B.V. All rights reserved.
منابع مشابه
Implementation of Fdtd-compatible Green’s Function on Heterogeneous Cpu-gpu Paral- Lel Processing System
This paper presents an implementation of the FDTDcompatible Green’s function on a heterogeneous parallel processing system. The developed implementation simultaneously utilizes computational power of the central processing unit (CPU) and the graphics processing unit (GPU) to the computational tasks best suited for each architecture. Recently, closed-form expression for this discrete Green’s fun...
متن کاملGpu Accelerated Parallel Branch Prediction for Multi/many-core Processor Simulation
Branch Prediction is a common function in nowadays microprocessors. Branch predictor is duplicated in each core of a multi/many-core processor and makes prediction for multiple concurrent running programs respectively. To evaluate the parallel branch prediction in a multi/many-core processor, existing schemes generally use a parallel simulator running on a CPU that does not have a real massive ...
متن کاملA Novel Scheme for High Performance Finite-Difference Time-Domain (FDTD) Computations Based on GPU
Finite-Difference Time-Domain (FDTD) has been proved to be a very useful computational electromagnetic algorithm. However, the scheme based on traditional general purpose processors can be computationally prohibitive and require thousands of CPU hours, which hinders the large-scale application of FDTD. With rapid progress on GPU hardware capability and its programmability, we propose in this pa...
متن کاملComputation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster
In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a per...
متن کاملGPU accelerated FDTD solver and its application in B1-shimming
Introduction FDTD is widely used for EM analyses in MRI applications, due to its simplicity and efficiency in wave modelling and its ability to handle field-tissue interactions. However, contemporary MRI problems, such as high field-tissue interaction modelling requires high performance computing. Conventional CPU-based FDTD calculations offer compromised computing performance without expensive...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computer Physics Communications
دوره 191 شماره
صفحات -
تاریخ انتشار 2015